Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-8082 cloud_io: add missing error handling to remote::download_object #24059

Conversation

pgellert
Copy link
Contributor

@pgellert pgellert commented Nov 7, 2024

The call to drain_response_stream may throw various transport related errors (see one example below of a broken pipe error observed in CI). These errors should be handled inside the remote::download_object method because the caller's expectation is that download-related errors are communicated via the download_result return type rather than through an exception. Some of these errors (like the broken pipe error below) could also be retried, whereas with the previous implementation, they were not retried.

These exceptions are often ignored by the caller and may be printed as "Exceptional future ignored" log lines, which cause CI failures and are less useful for debugging.

The below is an example of one such ignored exceptional future in the remote partition finalizing background fibre:

INFO  2024-10-29 12:41:17,708 [shard 1:main] cloud_storage - [fiber474 kafka/fuzzy-operator-6356-dzxvff/4] - remote_partition.cc:1406 - Finalizing remote storage state...
DEBUG 2024-10-29 12:41:17,723 [shard 1:main] cloud_io - [fiber819~0|1|19984ms] - remote.cc:430 - Receive OK response from "37836c6f-30b0-482f-bb4e-0f3dffdb5cbe/meta/kafka/fuzzy-operator-6356-dzxvff/1_3447/manifest.bin"
WARN  2024-10-29 12:41:17,723 [shard 1:main] http - /37836c6f-30b0-482f-bb4e-0f3dffdb5cbe/meta/kafka/fuzzy-operator-6356-dzxvff/1_3447/manifest.bin - client.cc:414 - receive error std::__1::system_error (error generic:32, System error during SSL read: [error:FFFFFFFF80000020:system library::Broken pipe]: Broken pipe)
WARN  2024-10-29 12:41:17,723 [shard 1:main] seastar - Exceptional future ignored: std::__1::system_error (error generic:32, System error during SSL read: [error:FFFFFFFF80000020:system library::Broken pipe]: Broken pipe), backtrace: 0xa73be23 0xa392e05 0x360a6b8 0x9352157 0x360a71a 0xa48cc6f 0xa49045c 0xa4e77ca 0xa402f3f /opt/redpanda/lib/libc.so.6+0x961b6 /opt/redpanda/lib/libc.so.6+0x11839b

This issue was highlighted by failures of the RandomNodeOperations test with tiered storage enabled running in CDT, which are linked below.

Fixes https://redpandadata.atlassian.net/browse/CORE-8082
Fixes https://redpandadata.atlassian.net/browse/CORE-8052
Fixes https://redpandadata.atlassian.net/browse/CORE-8018
Fixes https://redpandadata.atlassian.net/browse/CORE-8009
Fixes https://redpandadata.atlassian.net/browse/CORE-7979
Fixes https://redpandadata.atlassian.net/browse/CORE-7834
Fixes https://redpandadata.atlassian.net/browse/CORE-7767
Fixes https://redpandadata.atlassian.net/browse/CORE-7675
Fixes https://redpandadata.atlassian.net/browse/CORE-7315

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Bug Fixes

  • Fixes a rare bug during remote partition manifest downloads where broken pipe exceptions weren't retried in an edge case.

@pgellert pgellert requested review from Lazin, andrwng and a team November 7, 2024 14:33
@pgellert pgellert self-assigned this Nov 7, 2024
@pgellert pgellert requested review from IoannisRP and removed request for a team November 7, 2024 14:33
Lazin
Lazin previously approved these changes Nov 7, 2024
This is to allow passing in a `retry_chain_logger` which does not
inherit from `ss::logger` but wraps it.
The call to `drain_response_stream` may throw various transport related
errors (see one example below of a Broken Pipe error observed in CI).
These errors should be handled inside the `remote::download_object`
method because the caller's expectation is that download-related errors
are communicated via the `download_result` return type rather than
through an exception. Some of these errors (like the broken pipe error
below) could also be retried, whereas with the previous implementation
they were not retried.

These exceptions are often ignored by the caller and may be printed as
"Exceptional future ignored" log lines, which cause CI failures and are
less useful for debugging.

The below is an example of one such ignored exceptional future in the
remote partition finalizing background fibre:
```
INFO  2024-10-29 12:41:17,708 [shard 1:main] cloud_storage - [fiber474 kafka/fuzzy-operator-6356-dzxvff/4] - remote_partition.cc:1406 - Finalizing remote storage state...
DEBUG 2024-10-29 12:41:17,723 [shard 1:main] cloud_io - [fiber819~0|1|19984ms] - remote.cc:430 - Receive OK response from "37836c6f-30b0-482f-bb4e-0f3dffdb5cbe/meta/kafka/fuzzy-operator-6356-dzxvff/1_3447/manifest.bin"
WARN  2024-10-29 12:41:17,723 [shard 1:main] http - /37836c6f-30b0-482f-bb4e-0f3dffdb5cbe/meta/kafka/fuzzy-operator-6356-dzxvff/1_3447/manifest.bin - client.cc:414 - receive error std::__1::system_error (error generic:32, System error during SSL read: [error:FFFFFFFF80000020:system library::Broken pipe]: Broken pipe)
WARN  2024-10-29 12:41:17,723 [shard 1:main] seastar - Exceptional future ignored: std::__1::system_error (error generic:32, System error during SSL read: [error:FFFFFFFF80000020:system library::Broken pipe]: Broken pipe), backtrace: 0xa73be23 0xa392e05 0x360a6b8 0x9352157 0x360a71a 0xa48cc6f 0xa49045c 0xa4e77ca 0xa402f3f /opt/redpanda/lib/libc.so.6+0x961b6 /opt/redpanda/lib/libc.so.6+0x11839b
```
@pgellert pgellert force-pushed the fix/missing-error-handling-download-manifest branch from 0825419 to ad14537 Compare November 7, 2024 14:57
@pgellert
Copy link
Contributor Author

pgellert commented Nov 7, 2024

force-push: format the bazel build file to fix the linting error

@pgellert pgellert requested a review from Lazin November 7, 2024 14:58
@vbotbuildovich
Copy link
Collaborator

the below tests from https://buildkite.com/redpanda/redpanda/builds/57781#01930722-dd26-462e-b3c6-242f5ac00b6e have failed and will be retried

translator_test_rpfixture

@vbotbuildovich
Copy link
Collaborator

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/57781#0193077d-fcd8-4922-9958-3b8ff09dc64d:

"rptest.tests.enterprise_features_license_test.EnterpriseFeaturesTest.test_enable_features.feature=Feature.oidc.install_license=False.disable_trial=False"

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#57781

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/enterprise_features_license_test.py::EnterpriseFeaturesTest.test_enable_features@{"disable_trial":false,"feature":5,"install_license":false}

@@ -53,6 +53,7 @@ redpanda_cc_library(
"//src/v/utils:functional",
"//src/v/utils:log_hist",
"//src/v/utils:named_type",
"//src/v/utils:retry_chain_node",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably doesn't have material impact here but as rule of thumb this should go to implementation_deps https://redpandadata.atlassian.net/wiki/spaces/CORE/pages/714670307/Bazel

@pgellert pgellert merged commit 290ae07 into redpanda-data:dev Nov 8, 2024
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24059-v24.2.x-11 remotes/upstream/v24.2.x
git cherry-pick -x f3888315ad ad14537c81

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24059-v24.1.x-193 remotes/upstream/v24.1.x
git cherry-pick -x f3888315ad ad14537c81

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants